Question 1: Download the data, load it into python. Report the number of rows and columns that you've loaded.
Answer: The data has more than 1.5 M records (1510722) and 21 columns.
import os
import pandas as pd
from IPython.display import clear_output
import os
import pandas as pd
from IPython.display import clear_output
def load_data(month = '2016-02'):
file_path = f'green_tripdata_{month}.pickle'
if os.path.isfile(file_path):
df = pd.read_pickle(file_path)
else:
url="https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_{}.csv".format(month)
df = pd.read_csv(url)
df.to_pickle(file_path)
print(df.shape)
return df
df = load_data()
(1510722, 21)
Question 2: Visualize trip distance by time of day in any way you see fit, any observations?
Answer:
Most of these trips are below 4 km and half of them are very short distance trips (<2 km)
The distance of the trips has a high variance at 5 am and 6 am, while the trip distance's distribution is more concentrated around short distance at 5 pm and 6 pm.
The trips started between 5 am and 6 am are more likely to be long distance (>10 km) trips than at other times.
I can guess why people get up early in the morning and take these trips. Maybe it's people who need to get to an airport or train station in the early mornings that made those trips. In that case we should be able to see main stations among the most popular drop-off locations of these trips which happened in the early mornings.
The highest peak of trip distance (summed by hour) happens in the evening after work between 6 pm and 7 pm (sum: dark purple line), and the median travel distance (median: rosy pink line) shows that they are very likely to be very short trips (<1.8 km).
The second highest peak is between 8 am and 9 am.
From midnight to 7 am the trip distance (summed by hour) is obviously lower than the rest of the day.
From 'Plot 2-2B: sum distance and count of trips by hour' we can see that the average distance per trip per hour is likely to be relatively the same.
Because we have not separated weekends from workdays, workdays clearly dominate the pattern of the sum of distance by hour. I can imagine a typical urban lifestyle: in the morning around 8 am or 9 am people who are late may take a taxi to the office. After work, they may have many recreational activities like going to bars and restaurants by cabs. Then they start to go home, after midnight most people are not traveling even in New York.
# Plot 2-1
import matplotlib.pyplot as plt
dist_col = 'Trip_distance'
time_col = 'lpep_pickup_datetime'
times = pd.to_datetime(df[time_col])
tmp_med = df.groupby([times.dt.hour])[dist_col].median().to_frame(name = 'Median_dist_by_hour')
tmp_sum = df.groupby([times.dt.hour])[dist_col].sum().to_frame(name = 'Sum_dist_by_hour')
tmp_count = df.groupby([times.dt.hour])[dist_col].count().to_frame(name = 'Count_trips_by_hour')
fig, axes = plt.subplots(1, 1, figsize=(8, 4), dpi=100, sharex=False)
plot_df = tmp_sum.merge(tmp_med, left_index = True, right_index = True).merge(tmp_count, left_index = True, right_index = True)
plot_df['rank_sum_dist'] = plot_df['Sum_dist_by_hour'].rank(ascending = False)
plot_df['rank_med_dist'] = plot_df['Median_dist_by_hour'].rank(ascending = False)
plot_df = plot_df.reset_index()
grouped = df[[dist_col,time_col]].groupby(pd.to_datetime(df[time_col]).dt.hour)
grouped.boxplot(rot=45, fontsize=12,subplots=False, showfliers=False, ax = axes)
axes.set_xticks(ticks = list(range(1, 25)))
axes.set_xticklabels([f'{i}H' for i in range(24)])
plt.title('Plot 2-1: boxplot of trip distance per hour')
plt.show()
From the plot below we can see that the highest peak of trip distance happens between 5 pm and 8 pm.
The second highest peak is between 8 am and 9 am.
From midnight to 7 am the sum of trip distance is obviously lower than other time of the day.
Because we have not separate weekend from workdays, workdays clearly dominate the pattern here.
# Plot 2-2
x = plot_df.index
y1 = plot_df.Sum_dist_by_hour
y2 = plot_df.Median_dist_by_hour
y3 = plot_df.Count_trips_by_hour
fig, ax1 = plt.subplots(1,1, figsize=(8, 3))
ax2 = ax1.twinx()
ax1.plot(x, y1, "#6C5B7B")
ax2.plot(x, y2, "#C06C84")
ax1.set_xlabel('pickup time 24 H (group by hour)')
ax1.set_xticks(ticks = list(range(24)))
ax1.set_xticklabels([f'{i}' for i in range(24)])
ax1.set_ylabel('Sum of trip distance per hour', color="#6C5B7B")
ax2.set_ylabel('Median of trip distance per hour', color="#F8B195")
plt.title('Plot 2-2A: sum and median distance by hour')
plt.show()
fig, ax1 = plt.subplots(1,1, figsize=(8, 3))
ax2 = ax1.twinx()
ax1.plot(x, y1, "#6C5B7B")
ax2.plot(x, y3, "#F8B195")
ax1.set_xlabel('pickup time 24 H (group by hour)')
ax1.set_xticks(ticks = list(range(24)))
ax1.set_xticklabels([f'{i}' for i in range(24)])
ax1.set_ylabel('Sum of trip distance per hour', color="#6C5B7B")
ax2.set_ylabel('Count of trip distance per hour', color="#F8B195")
plt.title('Plot 2-2B: sum distance and count of trips by hour')
plt.show()
From the plot 2-3A below we can see a clear and simple pattern of the median distance per trip per hour:
between 5 am and 6 am the median distance of the trips are significantly longer than the rest of the day.
# Plot 2-3A
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.lines import Line2D
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from textwrap import wrap
df_sorted = plot_df
ANGLES = np.linspace(0, 2 * np.pi, len(df_sorted), endpoint=False)
SUM_DIST= df_sorted["Sum_dist_by_hour"].values
MEDIAN_DIST = df_sorted["Median_dist_by_hour"].values
TIME = df_sorted["lpep_pickup_datetime"].values
RANK_SUM = df_sorted["rank_sum_dist"].values
GREY12 = "#1f1f1f"
plt.rcParams.update({"font.family": "DejaVu Sans"})
plt.rcParams["text.color"] = GREY12
plt.rc("axes", unicode_minus=False)
COLORS = ["#6C5B7B","#C06C84","#F67280","#F8B195"]
cmap = mpl.colors.LinearSegmentedColormap.from_list("my color", COLORS, N=256)
norm = mpl.colors.Normalize(vmin=RANK_SUM.min(), vmax=RANK_SUM.max())
COLORS = cmap(norm(RANK_SUM))
fig, ax = plt.subplots(figsize=(5, 5), subplot_kw={"projection": "polar"})
fig.patch.set_facecolor("white")
ax.set_facecolor("white")
plt.setp(ax.get_yticklabels(), visible=False)
ax.set_xticks(np.linspace(0, 2*np.pi, 24, endpoint=False))
ax.set_xticklabels(range(24))
ax.set_theta_direction(-1)
ax.set_theta_offset(np.deg2rad(90))
ax.bar(ANGLES, SUM_DIST, color=COLORS, alpha=0.9, width=0.2, zorder=0)
ax.set_title('Plot 2-3A: Sum of trip distance per hour')
plt.show()
RANK_MED = df_sorted["rank_med_dist"].values
norm = mpl.colors.Normalize(vmin=RANK_MED.min(), vmax=RANK_MED.max())
COLORS = cmap(norm(RANK_MED))
fig, ax = plt.subplots(figsize=(5, 5), subplot_kw={"projection": "polar"})
fig.patch.set_facecolor("white")
ax.set_facecolor("white")
plt.setp(ax.get_yticklabels(), visible=False)
ax.set_xticks(np.linspace(0, 2*np.pi, 24, endpoint=False))
ax.set_xticklabels(range(24))
ax.set_theta_direction(-1)
ax.set_theta_offset(np.deg2rad(90))
ax.bar(ANGLES, MEDIAN_DIST, color=COLORS, alpha=0.9, width=0.2, zorder=0)
ax.set_title('Plot 2-3B: Median of trip distance per hour')
plt.show()
Question 3: What are the most popular pickup locations on weekends vs weekdays?
Answer:
The top 1 popular pickup location is the same for both weekends and weekdays, which is still at the Forest Hills Station.
The top 3 locations for weekdays are:
The top 3 locations for weekends are:
This result shows that the popular pick-up locations are not very different between weekdays and weekends. Train stations in Queens are always the busiest among the three center boroughs (Manhattan, Queens and Brooklyn). The only difference might be: on weekdays Marcus Garvey Park in Manhattan is more active, while on weekends it's beaten by McCarren Park in Brooklyn.
# weekdays
location_visual_df.loc[~location_visual_df.weekend].groupby(['Pickup_longitude', 'Pickup_latitude']).count().sort_values(by = 'weekend', ascending = False).head(5)
| weekend | ||
|---|---|---|
| Pickup_longitude | Pickup_latitude | |
| 0.000000 | 0.000000 | 1818 |
| -73.844292 | 40.721352 | 45 |
| -73.936958 | 40.764774 | 40 |
| -73.844276 | 40.721336 | 38 |
| -73.844299 | 40.721352 | 37 |
# weekend
df['dayofweek'] = pd.to_datetime(df['lpep_pickup_datetime']).dt.dayofweek
df['weekend'] = df['dayofweek']>4
location_visual_df = df[['Pickup_longitude', 'Pickup_latitude','weekend']].copy()
location_visual_df.loc[location_visual_df.weekend].groupby(['Pickup_longitude', 'Pickup_latitude']).count().sort_values(by = 'weekend', ascending = False).head(5)
| weekend | ||
|---|---|---|
| Pickup_longitude | Pickup_latitude | |
| 0.000000 | 0.000000 | 749 |
| -73.844299 | 40.721352 | 17 |
| -73.844276 | 40.721329 | 16 |
| -73.844299 | 40.721367 | 15 |
| -73.844269 | 40.721329 | 14 |
Answer: The top 1 popular pickup location is the same for both weekends and weekdays, which is at the Forest Hills Station.
The top 3 locations for weekdays are:
The top 3 locations for weekends are:
# https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.dayofweek.html
# monday 0 Sunday 6
from datetime import datetime
df['dayofweek'] = pd.to_datetime(df['lpep_pickup_datetime']).dt.dayofweek
df['weekend'] = df['dayofweek']>4
location_visual_df = df[['Pickup_longitude', 'Pickup_latitude','weekend']].copy()
for c in ['Pickup_longitude', 'Pickup_latitude']:
low, high = df[c].quantile([0.01,0.99])
location_visual_df=location_visual_df.query('{low}<{c}<{high}'.format(low=low,c = c,high=high))
fig, ax = plt.subplots(figsize=(4,4))
h = plt.hist2d(location_visual_df.loc[~location_visual_df.weekend,'Pickup_longitude'], location_visual_df.loc[~location_visual_df.weekend,'Pickup_latitude'], bins=100)
fig.colorbar(h[3], ax=ax)
plt.title('Figure 3-2A Weekdays heatmap')
plt.show()
fig, ax = plt.subplots(figsize=(4,4))
plt.hist2d(location_visual_df.loc[location_visual_df.weekend,'Pickup_longitude'], location_visual_df.loc[location_visual_df.weekend,'Pickup_latitude'], bins=100)
fig.colorbar(h[3], ax=ax)
plt.title('Figure 3-2B Weekend heatmap')
plt.show()
def top_n_location(df_sub, n):
df_sub_cluster = np.histogram2d(df_sub.loc[:, 'Pickup_longitude'], df_sub.loc[:, 'Pickup_latitude'], bins=100)
counts = df_sub_cluster[0]
lons = []
lats = []
for i in range(n):
top_n_counts = np.sort(counts, axis=None)[-i-1]
lon_bin_id, lat_bin_id = np.where(counts == top_n_counts)
lon = (df_sub_cluster[1][lon_bin_id[0]]+df_sub_cluster[1][lon_bin_id[0]+1])/2
lat = (df_sub_cluster[2][lat_bin_id[0]]+df_sub_cluster[2][lat_bin_id[0]+1])/2
print('The {} popular location has count {} is around ({}, {})'.format(str(i+1), top_n_counts, lon, lat))
lons.append(lon)
lats.append(lat)
return lons, lats
weekday_top_n_lons, weekday_top_n_lats = top_n_location(location_visual_df.loc[~location_visual_df.weekend], 3)
The 1 popular location has count 17075.0 is around (-73.84473026275634, 40.7220411682129) The 2 popular location has count 12328.0 is around (-73.93827632904052, 40.80425399780273) The 3 popular location has count 11339.0 is around (-73.89054874420165, 40.74787948608399)
weekend_top_n_lons, weekend_top_n_lats = top_n_location(location_visual_df.loc[location_visual_df.weekend], 3)
The 1 popular location has count 6444.0 is around (-73.84473026275634, 40.72204586029052) The 2 popular location has count 5942.0 is around (-73.89054874420165, 40.74788333892822) The 3 popular location has count 5680.0 is around (-73.95736736297607, 40.72204586029052)
import matplotlib.pyplot as plt
import numpy as np
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
import io
from urllib.request import urlopen, Request
from PIL import Image
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
def image_spoof(self, tile):
url = self._image_url(tile)
req = Request(url)
req.add_header('User-agent','Anaconda 3')
fh = urlopen(req)
im_data = io.BytesIO(fh.read())
fh.close()
img = Image.open(im_data)
img = img.convert(self.desired_tile_form)
return img, self.tileextent(tile), 'lower'
cimgt.OSM.get_image = image_spoof
osm_img = cimgt.OSM()
fig = plt.figure(figsize=(8,6))
ax1 = plt.axes(projection=osm_img.crs)
x0 = -74.0
y0 = 40.5
x1 = -73.8
y1 =41
center_pt = [(y0+y1)/2,(x0+x1)/2]
zoom = 0.1
extent = [center_pt[1]-(zoom*2.0),center_pt[1]+(zoom*2.0),center_pt[0]-(zoom*1.8),center_pt[0]+(zoom*1.8)]
ax1.set_extent(extent)
ax1.set_xticks(np.linspace(extent[0],extent[1],7),crs=ccrs.PlateCarree())
ax1.set_yticks(np.linspace(extent[2],extent[3],7)[1:],crs=ccrs.PlateCarree())
lon_formatter = LongitudeFormatter(number_format='0.2f',degree_symbol='',dateline_direction_label=True)
lat_formatter = LatitudeFormatter(number_format='0.2f',degree_symbol='')
ax1.xaxis.set_major_formatter(lon_formatter)
ax1.yaxis.set_major_formatter(lat_formatter)
ax1.xaxis.set_tick_params(labelsize=10)
ax1.yaxis.set_tick_params(labelsize=10)
scale = np.ceil(-np.sqrt(2)*np.log(np.divide(zoom,350.0)))
scale = (scale<20) and scale or 19
ax1.add_image(osm_img, int(scale))
lons, lats, rank = weekday_top_n_lons, weekday_top_n_lats, [1, 2, 3]
ax1.plot(lons, lats, markersize=10, marker='o',linestyle='',color='green',transform=ccrs.PlateCarree(), alpha = 0.7, label = 'weekday')
lons, lats, rank = weekend_top_n_lons, weekend_top_n_lats, [1, 2, 3]
ax1.plot(lons, lats, markersize=5, marker='o',linestyle='',color='red',transform=ccrs.PlateCarree(), alpha = 0.7, label = 'weekend')
plt.legend()
plt.title('Figure 3-2C Most popular pickup locations on weekdays vs weekends')
plt.show()
Question 4: Build a model to forecast the number of trips by hour for the next 12 hours after Feb 12th 10:00 am. How well did you do?
Answer: The best model was trained on a combination of data from 2015-02 and data from 2016-02-01 to 2016-02-12 10 am, with mean absolute error 204.16 on test data (Feb 12 11 am - Feb 12 10 pm)
Problem description
It is a typical univariate time series forecasting problem and there are many existing solutions for it. The first one I can think of is called Seasonal AutoRegressive Integrated Moving Average (SARIMA) model. It has been used widely for time series forecasting for decades. Machine learning models like LSTM and causal Convnet are also very welcome for complex or high-dimensional time series analysis. However, 12 days of hourly single variable is too small to train a neural network model. Besides, SARIMA models' output are way easier to understand and evaluate.
Differencing and stationarity Section 4.1
Before I started building any models, I checked the stationary and seasonal trend of the number of trips by hour in the whole month of Feb 2016. From the Plot 4-1: Num of trips by hour we can get a good understanding about what the time series looks like. We can see that the original time series was not stable at all. The Augmented Dickey Fuller Test (ADF) is a unit root test for stationarity. The test results show that after making the order 1 differencing on the original time series, the new time series becomes stationary and the null hypothesis (there is a unit root in a univariate process) is rejected with p value < 0.000001.
ACF and PACF Section 4.2.1
The Autocorrelation function (ACF) and partial autocorrelation function (PACF) for a single time series variable can help us understand the temporal dynamics of an individual time series. The ACF correlogram is often used for moving average (MA) models hyper-parameter estimation and PACF correlogram is used for autoregressive (AR) models. The combination of an MA and AR model is called the AutoRegressive Integrated Moving Average (ARIMA) model. If we add seasonality into account we will get a Seasonal-ARIMA model (SARIMA). From the correlograms below I roughly selected a range for each of the parameters in a SARIMA model (p, d, q, P, D, Q, S), then in the next step I used grid search to find the best params to optimize the prediction.
SARIMA model selection Section 4.2.2
Data Split In order to make prediction from Feb 12 11 am - Feb 12 10 pm without cheating, I first split the data into three: train, test and validation sets. train_range = ('2016-02-01 00:00:00','2016-02-12 10:59:59') val_range = ('2016-02-19 11:00:00','2016-02-19 22:59:59') test_range = ('2016-02-12 11:00:00','2016-02-12 22:59:59') Here we got a training data with 275 rows from 12 days and a test data with 12 rows from half of a day. This maybe a concern as we can see see from the PACF in 4.2.1 that there is a strong autoregressive pattern on lags later than 100. This is caused by the weekly seasonality trend. We probably cannot capture this seasonality trend as well as the daily trend, because we have less than two weeks of training data. But it doesn't hurt to give it a try.
Grid Search I used the Akaike information criterion (AIC), along with the mse on validation data, to estimate the quality of the models in order to pick the best params from grid-search result (both AIC and MSE are the smaller the better). I limited the sum of the p+d+q+P+D+Q to be in the range of (4, 9) to reduce the time spent on searching.
import os
import pandas as pd
from IPython.display import clear_output
df_16 = load_data(month = '2016-02')
(1510722, 21)
import matplotlib.pyplot as plt
def agg_data_hour(df):
time_df = df[['lpep_pickup_datetime']].copy()
time_df['time']=pd.to_datetime(time_df.lpep_pickup_datetime)
trip_by_hour = time_df.groupby([pd.Grouper(key='time',freq='H')]).size().reset_index(name='count')
return trip_by_hour
def plot_differencing(df):
trip_by_hour = agg_data_hour(df)
# Plot
fig, axes = plt.subplots(3, 1, figsize=(12,10), dpi=100, sharex=True)
# Usual Differencing
axes[0].plot(trip_by_hour['count'], label='Original Series')
axes[0].plot(trip_by_hour['count'].diff(1), label='Usual Differencing order 1')
axes[0].set_title('Usual Differencing')
axes[0].legend(loc='upper left', fontsize=10)
axes[1].plot(trip_by_hour['count'], label='Original Series')
axes[1].plot(trip_by_hour['count'].diff(24), label='Seaonal Differencing order 1')
axes[1].set_title('Seasonal Differencing')
axes[1].legend(loc='upper left', fontsize=10)
axes[2].plot(trip_by_hour['count'], label='Original Series')
axes[2].plot(trip_by_hour['count'].diff(1).diff(24), label='Both Differencing')
axes[2].set_title('Both Differencing')
axes[2].legend(loc='upper left', fontsize=10)
plt.legend(loc='upper left', fontsize=10)
plt.suptitle('Plot 4-1 Num of trips by hour', fontsize=16)
plt.tight_layout()
plt.show()
plot_differencing(df_16)
# # test to see if it is stationary
# # hypothesis of the ADF test is that the time series is non-stationary, p < 0.05 means the ts is very likely to be stationary
from statsmodels.tsa.stattools import adfuller
from numpy import log
trip_by_hour = agg_data_hour(df_16)
result = adfuller(trip_by_hour['count'].values, regression='ct')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
# First order of differencing is enough
result = adfuller(trip_by_hour['count'].diff(1).dropna(), regression = 'c')
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
# d = 1 D = 0
ADF Statistic: -2.431626 p-value: 0.362945 ADF Statistic: -18.549451 p-value: 0.000000
import numpy as np, pandas as pd
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(12,9), 'figure.dpi':240})
# Original Series
fig, axes = plt.subplots(3, 2, sharex=False)
axes[0, 0].plot(trip_by_hour['count'].values); axes[0, 0].set_title('Original Series')
plot_acf(trip_by_hour['count'].values, ax=axes[1, 0], lags = 340)
plot_pacf(trip_by_hour['count'].values, ax=axes[2, 0], lags = 340)
# 1st Differencing
axes[0, 1].plot(trip_by_hour['count'].diff()); axes[0, 1].set_title('1st Order Differencing')
plot_acf(trip_by_hour['count'].diff().dropna(), ax=axes[1, 1], lags = 340)
plot_pacf(trip_by_hour['count'].diff().dropna(), ax=axes[2, 1], lags = 340)
plt.tight_layout()
plt.show()
/home/alice/miniconda3/envs/cartopy/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:1434: RuntimeWarning: invalid value encountered in sqrt return rho, np.sqrt(sigmasq) /home/alice/miniconda3/envs/cartopy/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:1434: RuntimeWarning: invalid value encountered in sqrt return rho, np.sqrt(sigmasq)
import statsmodels.api as sm
res = sm.tsa.seasonal_decompose(trip_by_hour['count'],freq=24)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
plt.axvline(x = 274)
plt.axvline(x = 286)
plt.show()
# we have a weekly trend here
<ipython-input-76-9aa910196515>:2: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead res = sm.tsa.seasonal_decompose(trip_by_hour['count'],freq=24)
import statsmodels.api as sm
res = sm.tsa.seasonal_decompose(trip_by_hour['count'],freq=24*7)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
plt.axvline(x = 274)
plt.axvline(x = 286)
plt.show()
# if we take both day and week into count the seasonal trend is almost removed
# The period we need to predict looks very normal which makes the problem simpler
<ipython-input-77-6203ebf2b0e4>:2: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead res = sm.tsa.seasonal_decompose(trip_by_hour['count'],freq=24*7)
Though SARIMA doesn't support multiple seasonality modeling, it doesn't hurt to git it a try.
# Find AR order and SAR order from PACF
# p = 1-2
# P = 0-1
import warnings
warnings.filterwarnings('ignore')
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
fig, axes = plt.subplots(1, 2, sharex=False)
axes[0].plot(trip_by_hour['count'].diff(1)); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(-1,1))
plot_pacf(trip_by_hour['count'].diff(1).dropna(), ax=axes[1], lags = 240)
plt.axhline(y = 0.2)
plt.axhline(y = -0.2)
plt.show()
# Find MA order and SMA order from ACF
# q = 1-2
# Q = 2-3
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
fig, axes = plt.subplots(1, 2, sharex=False)
axes[0].plot(trip_by_hour['count'].diff()); axes[0].set_title('1st Differencing')
#axes[1].set(ylim=(0,1.2))
plot_acf(trip_by_hour['count'].diff().dropna(), ax=axes[1], lags = 240)
plt.axhline(y = 0.2)
plt.axhline(y = -0.2)
plt.show()
def data_split(trip_by_hour, train_range = ('2016-02-01 00:00:00','2016-02-12 10:59:59'),
val_range = ('2016-02-19 11:00:00','2016-02-19 22:59:59'),
test_range = ('2016-02-12 11:00:00','2016-02-12 22:59:59')):
tr_start,tr_end = train_range
te_start,te_end = test_range
va_start, va_end = val_range
train_data = trip_by_hour.loc[(trip_by_hour.time>=tr_start)&(trip_by_hour.time<=tr_end), :].set_index('time')
print(train_data.shape)
val_data = trip_by_hour.loc[(trip_by_hour.time>=va_start)&(trip_by_hour.time<=va_end), :].set_index('time')
print(val_data.shape)
test_data = trip_by_hour.loc[(trip_by_hour.time>=te_start)&(trip_by_hour.time<=te_end), :].set_index('time')
print(test_data.shape)
return train_data, val_data, test_data
train_data, val_data, test_data = data_split(trip_by_hour)
(275, 1) (12, 1) (12, 1)
test_data.mean()
count 3435.5 dtype: float64
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from IPython.display import clear_output
import itertools
p = range(3)
d = range(2)
q = range(3)
P = range(3)
D = range(2)
Q = range(4)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 24) for x in list(itertools.product(P, D, Q))]
def grid_search_params(train_data, val_data, pdq, seasonal_pdq, max_sum = 9, min_sum = 4, max_iter = 100):
params = []
params_s = []
aics = []
mses = []
cnt = 0
best_params = None
current_aic = None
for param in pdq:
for param_seasonal in seasonal_pdq:
if cnt >= max_iter:
break
if (sum(param_seasonal[:3])+sum(param) <= max_sum) and (sum(param_seasonal[:3])+sum(param)>=min_sum):
#try:
mod = sm.tsa.statespace.SARIMAX(train_data['count'],
trend='c',
order=param,
seasonal_order=param_seasonal,
frequency = 'H')
results = mod.fit()
pred = results.get_prediction(start = pd.to_datetime(val_data.index[0]),
end = pd.to_datetime(val_data.index[-1]))
params.append(param)
params_s.append(param_seasonal)
aics.append(results.aic)
mses.append(mean_squared_error(val_data, pred.predicted_mean))
if current_aic is None:
current_aic = results.aic
elif current_aic >results.aic:
current_aic = results.aic
best_params = [param, param_seasonal]
print('ARIMA{}x{} - AIC:{}'.format(param, param_seasonal, current_aic))
cnt+=1
#except:
# continue
return params, params_s, aics, mses
params, params_s, aics, mses = grid_search_params(train_data, val_data, pdq, seasonal_pdq, max_sum = 9, min_sum = 3, max_iter = 100)
clear_output()
mod_selection_summary = pd.DataFrame(list(zip(params, params_s, aics, mses)), columns = ['pdq', 'PDQS', 'AIC', 'MSE'])
mod_selection_summary.sort_values('AIC').head()
| pdq | PDQS | AIC | MSE | |
|---|---|---|---|---|
| 83 | (0, 1, 1) | (0, 1, 2, 24) | 3411.547005 | 2.707297e+05 |
| 91 | (0, 1, 1) | (1, 1, 2, 24) | 3412.864229 | 1.377269e+06 |
| 84 | (0, 1, 1) | (0, 1, 3, 24) | 3413.050754 | 2.885561e+05 |
| 98 | (0, 1, 1) | (2, 1, 1, 24) | 3414.152866 | 3.381155e+05 |
| 90 | (0, 1, 1) | (1, 1, 1, 24) | 3414.454736 | 2.294201e+05 |
best_params = [(0,1,1), (1,1,2,24)]
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from IPython.display import clear_output
if best_params is None:
# best params from R
best_params = [(0,1,3), (2,1,3,24)]
def best_model(best_params, train_data):
sarimax = sm.tsa.statespace.SARIMAX(train_data,
trend='ct',
order=best_params[0],
seasonal_order = best_params[1]).fit(disp = True)
clear_output()
print(best_params)
print(sarimax.summary().tables[1])
sarimax.plot_diagnostics(figsize=(18, 8))
plt.show()
return sarimax
sarimax = best_model(best_params, train_data)
[(0, 1, 1), (1, 1, 2, 24)]
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 0.5365 33.962 0.016 0.987 -66.028 67.101
drift -0.0042 0.240 -0.017 0.986 -0.474 0.465
ma.L1 0.4355 0.048 9.112 0.000 0.342 0.529
ar.S.L24 -0.2033 0.278 -0.732 0.464 -0.748 0.341
ma.S.L24 -0.5066 4.178 -0.121 0.903 -8.695 7.682
ma.S.L48 -0.4875 1.922 -0.254 0.800 -4.254 3.279
sigma2 3.874e+04 1.56e+05 0.248 0.804 -2.67e+05 3.44e+05
==============================================================================
from sklearn.metrics import mean_absolute_error
def model_eval(sarimax, test_data, full_data = None):
pred = sarimax.get_prediction(start = pd.to_datetime(test_data.index[0]),
end = pd.to_datetime(test_data.index[-1]))
pred_ci = pred.conf_int()
result_df = pred.predicted_mean.to_frame().merge(test_data, left_index = True, right_index = True)
result_df.columns = ['predicted_mean','observed']
result_df=result_df.merge(pred_ci, left_index = True, right_index = True)
fig, ax = plt.subplots(1, 1, sharex=False, figsize=(14, 4))
ax.plot(result_df.index, result_df.observed, label = 'observed')
ax.plot(result_df.index, result_df.predicted_mean, label = 'predicted')
ax.fill_between(result_df.index,
result_df['lower count'],
result_df['upper count'], color='k', alpha=.2)
if full_data is not None:
ax.plot(full_data.index, full_data['count'], label = 'observed full')
ax.set_xlabel('Hour')
ax.set_ylabel('Sum of Trips per hour')
mse = mean_squared_error(result_df.observed, result_df.predicted_mean)
mae = mean_absolute_error(result_df.observed, result_df.predicted_mean)
plt.title('mae {} on test data'.format(str(round(mae, 2))))
plt.legend()
plt.show()
return result_df
result_df = model_eval(sarimax, test_data, trip_by_hour.loc[(trip_by_hour.time>'2016-02-10')&(trip_by_hour.time<'2016-02-14')].set_index('time'))
12 days of training is not enough to take more complex seasonality (i.e. weekly trend) into consideration. Adding whole data from January may not be a good idea, because the first few days of January is more like a holiday schedule compared to the rest of January. So, I added the 2015 Feb data into the training set (2016 Feb 01 12 am - 2016 Feb 12 10 am).
df_15= load_data(month = '2015-02')
df_16 = load_data(month = '2016-02')
df_15 = df_15.reset_index().iloc[:, :-2]
df_15.columns = df_16.columns
df_15.head()
from datetime import datetime, timedelta
df_15_trip_by_hour = agg_data_hour(df_15)
df_15_trip_by_hour.time = df_15_trip_by_hour.time+timedelta(days = 337)
df_16_trip_by_hour = agg_data_hour(df_16)
(1574830, 21) (1510722, 21)
trip_by_hour = pd.concat([df_15_trip_by_hour, df_16_trip_by_hour], axis = 0, ignore_index = True)
train_data, val_data, test_data = data_split(trip_by_hour, train_range = ('2016-01-01 00:00:00','2016-02-12 10:59:59'))
params, params_s, aics, mses = grid_search_params(train_data, val_data, pdq, seasonal_pdq, max_sum = 9, min_sum = 3, max_iter = 100)
clear_output()
mod_selection_summary = pd.DataFrame(list(zip(params, params_s, aics, mses)), columns = ['pdq', 'PDQS', 'AIC', 'MSE'])
mod_selection_summary.sort_values('AIC').head(5)
| pdq | PDQS | AIC | MSE | |
|---|---|---|---|---|
| 99 | (0, 1, 1) | (2, 1, 2, 24) | 12799.285566 | 2.831572e+05 |
| 98 | (0, 1, 1) | (2, 1, 1, 24) | 12825.047927 | 1.026753e+06 |
| 84 | (0, 1, 1) | (0, 1, 3, 24) | 12830.741304 | 1.173885e+06 |
| 83 | (0, 1, 1) | (0, 1, 2, 24) | 12831.495818 | 1.353036e+06 |
| 91 | (0, 1, 1) | (1, 1, 2, 24) | 12832.258015 | 2.268699e+06 |
# 0,1,1,2,1,1,24 is the best considering both aic and mse
#best_params = (mod_selection_summary.sort_values('AIC').iloc[1]['pdq'], mod_selection_summary.sort_values('AIC').iloc[1]['PDQS'])
trip_by_hour = pd.concat([df_15_trip_by_hour, df_16_trip_by_hour], axis = 0, ignore_index = True)
train_data, val_data, test_data = data_split(trip_by_hour, train_range = ('2016-01-01 00:00:00','2016-02-12 10:59:59'))
best_params = [(0,1,1), (2,1,1,24)]
sarimax = best_model(best_params, train_data)
result_df_2 = model_eval(sarimax, test_data, trip_by_hour.loc[(trip_by_hour.time>'2016-02-10')&(trip_by_hour.time<'2016-02-14')].set_index('time'))
[(0, 1, 1), (2, 1, 1, 24)]
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept -5.9240 4.881 -1.214 0.225 -15.490 3.642
drift 0.0119 0.010 1.228 0.220 -0.007 0.031
ma.L1 0.3094 0.014 21.634 0.000 0.281 0.337
ar.S.L24 0.2036 0.024 8.416 0.000 0.156 0.251
ar.S.L48 -0.1352 0.031 -4.319 0.000 -0.197 -0.074
ma.S.L24 -0.9827 0.067 -14.668 0.000 -1.114 -0.851
sigma2 6.114e+04 2976.152 20.542 0.000 5.53e+04 6.7e+04
==============================================================================
TBATS models (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is another option, which can handle complex seasonality with sufficient training data. However, it takes a much longer time to train one model.
If I am a taxi driver, in the evenings when most people are off from work, it is the busiest time of a day for me. The trips are most likely to be short distant, which means I will probably just need drive around in the downtown area. For the rest of the day, big stations are where I should go to pick up new passengers.
A sad fact is most passengers don't give me tips at all :( People who travel in the early mornings tip the most.